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Abstract: 

The  introduction  of  high-speed  analog-to-digital  converters  has  resulted  in  many  of  the 
traditional  front-end  and  sub-array  combining  functions  of  multi-function,  phased-array  radar 
systems  being  performed  in  the  digital  rather  than  in  the  analog  domain.  Due  to  the  intense 
amount  of  processing  that  is  required,  many  of  these  functions  had  to  be  realized  in  hardware. 
This  was  originally  accomplished  using  VLSI  ASICs.  However,  the  advent  of  multi-million  gate 
field-programmable  gate  array  (FPGA)  has  permitted  these  complex  digital  processing  functions 
to  be  put  in  small  packages  with  a  degree  of  design  flexibility  that  is  normally  associated  only 
with  software.  This  permits  more  of  the  radar  functions  to  be  realized  in  commercial  off-the- 
shelf  (COTS)  hardware  by  obviating  the  need  of  full-custom  VLSI  in  many  cases. 

The  incorporation  of  FPGA  technology  into  COTS  processing  subsystems  permits  more  complex 
designs  to  be  created  than  could  be  achieved  by  general-purpose  or  digital  signal  processors 
alone.  Simply  incorporating  FPGAs  into  single  board  computers  could  solve  many  signal 
processing  problems.  However,  because  of  the  complexity  of  the  signal  processing  in  a  multi¬ 
function  radar  system,  a  distributed,  parallel-processing  architecture  is  usually  required.  In 
addition,  the  trend  toward  an  increasing  number  of  input  channels  and  the  formation  of  a  greater 
number  of  simultaneous  beams  requires  a  high  degree  of  interconnection  among  the  processing 
elements.  Therefore,  the  technology  used  to  interconnect  the  computing  elements  must  be 
flexible  enough  to  accommodate  different  architectures  and  system  requirements.  Furthermore, 
the  interconnection  technology  should  be  scalable  enough  to  enable  early  design  prototyping  as 
well  as  system  deployment  over  a  wide  range  of  mission  platforms. 
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Example  Interconnection  of  Radar  Front-End  Processing 


This  paper  focuses  on  the  impact  of  using  a  heterogeneous  distributed  computing  system  for 
digital  beamforming  in  a  multi-function  radar  system.  The  interconnection  of  FPGAs  requires 
balancing  the  utilization  of  FPGA  resources  for  endpoint  logic  I/O  with  that  for  processing 
requirements.  A  balance  must  also  be  struck  in  the  mapping  of  functions  between  the  FPGAs  and 
the  programmable  processors  in  a  heterogeneous  system.  Very  frequently,  the  scaling  of 
particular  requirements  will  require  the  interconnection  topology  to  change  rather  than  just  scale. 
We  examine  several  different  sets  of  requirements  and  the  subsequent  mapping  to  the 
heterogeneous  computing  platforms  and  the  tradeoffs  involved.  Particular  focus  is  given  to  the 
changes  in  functional  allocation  and  the  resulting  system  topologies. 
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Beamforming  Radar  System  Architecture 
Processing  Resources 
Strawman  System  Analysis 

♦  Front-End  Processing 

♦  Back-End  Processing 

♦  Beamformer  Architectures 

Summary 
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Beamforming  requires  massive  dataflow  and  computation 

♦  ADC  precision  and  data  rate  are  chosen  to  provide  high  dynamic  range  and 
and  wide  signal  bandwidth 

♦  High  number  of  input  channels  required  in  modern  phased  array  radars  to 
produce  multiple  beams  and  nulls 
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Microprocessors 

♦  Fixed  processing,  I/O,  and  memory  architecture 

♦  Task  context  switch  requires  microseconds 

♦  Native  floating-point  available 

♦  Low  interaction  between  code  modules 

FPGAs 

♦  Customizable  processing,  I/O,  and  memory  architecture 

♦  Task  context  switch  requires  reconfiguration  --  milliseconds 

♦  Floating-point  must  be  built  or  bought 

♦  Considerable  interaction  between  IP  cores 

♦  Signal  propagation  issues 

♦  Currently  harder  to  program  than  microprocessors 


©  2003  Mercury  Computer  Systems,  Inc. 


4 


400  - 1000  MHz  clock  speeds 

133  MHz  system  bus  (MPC74xx)  --  851  MB/s 

64-bit  integer  and  floating-point  units 

128-bit  AltiVec  vector  processing  unit 

Pipelined  instruction  unit 

32  kB  instruction  and  data  caches 

Up  to  2  MB  L2  cache 
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Clock  speeds  lower  than  processors:  100  -  200  MHz  clocks 
Up  to  20  full-duplex  multi-gigabit  transceivers. 

Many  DSP  supporting  features 
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Each  block  RAM  contains  two  banks  with  independent  sets  of  address  and  data  lines 
Gigabit  transceivers  provide  over  240  MBps  each  direction  —  over  4800  MBps  throughput! 
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•  Lots  of  channels  --  80+  input  channels 

•  ADC  with  “good”  bandwidth  and  dynamic  range 

♦  100  MSps  --  1.56  -  25  MHz  bandwidth  using  fs/4  sampling 

♦  14-bit  precision  --  over  80  dB  dynamic  range 

•  Reasonable  implementation  risk  -- 100  MHz  clock 


ANALOG/DIGITAL  CONVERTER  DATA  RATES 


INPUT  CHANNELS  PER  FPGA  USING  GIGABIT  Tx/Rx 
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ADC  precision  and  rate  and  number  of  channels  drive  downstream  requirements 
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Maximum  Input  Channels 


Digital  Down  Converter 

♦  fs/4  IF  &  BW  1 

♦  4x  decimation  J 

♦  31 -tap  complex  FIR,  real  symmetric  coefficients 


Eliminates  the  need  for  numerically 
controlled  oscillators  (NCO) 


Usually  no  bit  growth 
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Lowpass  Decimation  Filter 

♦  lx  (bypass),  2x,  4x,  8x,  and  16x  decimation  rates 

♦  0, 16,  32,  64, 128  taps 

♦  Real  coefficients 

♦  0  to  2  bits  of  bit  growth 


'  SNR„ '' 

Jim,; 


Equalizer 

♦  16-tap,  complex  coefficients  --  cannot  generally  exploit 
symmetry 

♦  Usually  no  bit  growth 
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TITT 


•  Reduce  complexity  --  exploit  fs/4  center  frequency  and  bandwidth 

♦  Complex  mixing  reduces  to  polyphase  commutation 
•  Cosine  and  sine  select  even  and  odd  samples  respectively 

-  cos(jn7t/4)  =  1, 0,  -1,  0, 1,...;  sin(jn7i/4)  =  0,  j,  0,  -j,  0,... 

♦  Exploit  polyphase  structure  for  decimation 
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Odd  number  of  taps  creates  symmetries  in  the  FIR  coefficients 
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Reduce  complexity  --  exploit  filter  symmetries 


Exploit  symmetric  filter  structures  for  in-phase  signal 
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Exploit  symmetry  pair  filters  for  quadrature  signal 
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Each  tap  calculation  involves  one  coefficient  and  two  samples 
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Reduce  complexity  --  exploit  4x  decimation 

♦  Use  MAC-Engine  to  do  4  multiplies  per  input  sample 

•  Use  folk  =  4  x  fs  to  time  share  multipliers 

♦  Configure  logic  slices  as  shift  registers  (SRL’s)  to  save  BRAM 

•  Need  to  store  3  sets  of  numbers  --  need  2  BRAM’s 

-  Save  BRAM  by  using  logic  slices  to  store  both  sets  of  samples 
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Reduce  complexity  --  use  MAC-Engine  FIR  implementation 

♦  Run  multipliers  at  4x  sample  rate  --  time  share  multipliers 

♦  Exploit  constant  length-decimation  product 

•  Single  structure  handles  multiple  filter  implementations 

•  Single  clock  frequency 

♦  Use  dual-bank  feature  of  BRAM  » r  _  , ,  Jc 

•  First  bank  stores  samples 
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Reduce  complexity  --  reduce  number  of  multipliers  and  BRAM’s 

♦  Exploit  fdk/fs  --  use  MAC-Engine 

♦  Implement  complex  multiply  using  only  3  MAC-Engines 

•  Use  common  product  term  in  complex  multiply 


yr  =  xrhr  - xfo 


y  =  x  h  +  x  h 

J  i  r  i  i  r 


(xr  -  X,  )hr  +  X,  (hr-h,)  =xr(hl+hr)-(xr-  xt  )h 


Trade  logic  slices  for  multipliers 


Trade  logic  slices  for  block  RAM 
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FPGA  features  can  be  exploited  to  maximize  utilization 

Up  to  20  100-MSps  channels  per  FPGA 
DDC  with  31 -Tap  FIR  using  only  3  multipliers/channel 
LPF  16-128  Tap  decimating  FIR  using  only  4  multipliers/channel 
EQU  16-Tap  complex  FIR  using  only  12  multipliers/channel 


♦ 

♦ 

♦ 
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Digital  Receiver  Module  for  20x  100  MSps  Channels  on  Virtex-ll  Pro  100 
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FPGAs  can  be  used  to  address  data  flow 
requirements  that  persist  in  the  system  until 
application  of  adaptive  beamforming  weights 

♦  Digital  Pulse  Compression 

•  Fast  convolution  with  FFT  IP  cores 

♦  Doppler  Processing 

•  FPGA  FFT  IP  cores  available 

♦  Adaptive  Beamforming  Weight  Application 

•  Similar  advantages  to  those  in  sub-array  beamformer 

FPGAs  can  augment  weight  computation 

♦  QR  Decomposition 

•  New  FPGA  solutions  may  replace  microprocessors 

♦  Cholesky  Decomposition 

•  Possibly  form  covariance  matrix  in  adjunct  FPGA 
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FFT  IP  cores  can  be  used  to  implement  pulse  compression 

♦  8192-tap  FFT  @  25  MSps/channel 

♦  6  sub-array  channels  /  FPGA 

♦  3-stage  pipelined  convolver  —  2  convolvers  /  FPGA 

♦  Enough  resources  to  sum  partial  products  from  beamformer 
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cores 


Multipliers  Block  Ram  Logic  Slices 


□  Processing 
■  I/O 

□  Memory  Ctr 

□  Margin 


DIGITAL  PULSE  COMPRESSION  FPGA  UTILIZATION 

FFT  cores  tend  to  be  BRAM  hungry. 
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Unconstrained  Linear  Architecture 

♦  All  input  channels  contribute  to  each  output 

Constrained  Linear  Architecture 

♦  A  subset  of  input  channels  contributes  to  any  output 

Mesh  Architecture 

♦  All  input  channels  contribute  to  each  output 
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Number  of  Output  Channels 


•  Basic  limits  are  imposed  by  I/O  and  number  of  multipliers 

•  Inputs  over  18-bits  can  increase  the  number  of  multipliers 

♦  Keep  watch  on  bit  growth  in  front-end  processing 


I/O  and  Multiplier  Constraints  for  Virtex-ll  Pro  100 
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Multiplexing  must  be  designed  to  maximize  communication 

♦  Beam  Partitioned  output  multiplexing  may  reduce  efficiency 

♦  Alternate  multiplexing  methods  may  be  necessary 
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Data  can  also  be  partitioned  by  link:  each  link  carried  an  integral  number  of  channels 
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Full  MxN  unconstrained  complex  matrix  multiply 
Outputs  only  from  a  single  module 
Processing  throughput  limited  by  beamformer  module  I/O 
Communication  latency  across  beamformer  is  an  issue 
Additional  beams  can  be  produced  by  multiple  passes  on  data 

♦  Decreases  overall  radar  duty  cycle 

♦  Memory  should  be  located  in  digital  beamformer  to  save  I/O  bandwidth 

♦  Increased  beamformer  processing  speed  may  be  required 


Beams 


f 

"Ti  " 

'  H  n 

Hn 

K 

H  i(iv-i) 

H  lN~ 

M 

= 

M 

M 

O 

M 

M 

l 

1 

1 

hm  i 

hm2 

A 

H  M  (N- 1) 

H  mn 

' 

M 

M 

M 

x*. 

Input 

Sets 


CcFT 


(PROC 


:^R  PC 


rcpr 

0D(PR0C)(PR0D(PR0QPR0C) 


Im¬ 


passes 


©  2003  Mercury  Computer  Systems,  Inc. 


20 


Number  of  Output  Channels 


Unconstrained  linear  beamformer  module  is  I/O  bound 

♦  Total  number  of  input  links  plus  output  links  is  constant 

♦  Choice  of  input  to  output  balance  affects  utilization 


Number  of  Input  Channels 

Note:adding  additional  non-MGT  connections  could  potentially  increase  throughput 
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I/O  and  compute  bounds  are  not  close  --  low  utilization 
36  x  96  unconstrained  matrix  multiply 

35  modules  required  for  FPGA  digital  processor 

♦  Front-end  -  5  modules 

♦  Small-array  beamformer  -  24  modules 

♦  Digital  pulse  compression  -  -  6  modules 
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I/O  and  compute  bounds  are  close  --  good  utilization 
40  x  100  unconstrained  matrix  multiply 

22  modules  required  for  FPGA  digital  processor 

♦  Front-end  -  5  modules 

♦  Small-array  beamformer  -  10  modules 

♦  Digital  pulse  compression  -  7  modules 
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The  Ultimate 
Performance  Machine 


Use  each  beamformer  module  to  produce  outputs 
MxN  constrained  complex  matrix  multiply 

♦  Use  only  a  subset  of  inputs  for  each  output 

I/O  and  computation  bounds  the  as  in  the  unconstrained  case 

♦  Inputs  and  outputs  must  be  balanced  to  maximize  utilization 


EXPLICIT  ZEROS  IN  BEAMFORMING  MATRIX 
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•  Adding  matrix  constraints  increases  the  number  of  outputs 
>  50  x  100  constrained  matrix  multiply 

*  19  modules  required  for  FPGA  digital  processor 

♦  Front-end  -  5  modules 

♦  Small-array  beamformer  -  5  modules 

♦  Digital  pulse  compression  -  9  modules 
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Mesh  architecture  offers  utilization  enhancement 

♦  I/O  and  computation  bounds  touch 

Full  unconstrained  matrix  multiply 

Partially  formed  beams  sent  forward  for  summing  in  DPC 


5  links 


5  links 


5  links 
10  channels 


12  channels 


li^n 


‘t 


r  *i 


12  channels 


e 


i 


5  links 
10  channels 


5  links 
10  channels 


i^ni 


IF 


w 

► 


T 


NO  EXPLICIT  ZEROS  IN  BEAMFORMING  MATRIX 

x, 

M 
M 
M 


Yi 

H 1, 

K 

H  l(tv-i) 

H  lN 

—  5  links 

M 

= 

M 

M 

O 

M 

M 

12  channels 

Y 

L  M  J 

H  M  i 

H  M  2 

A 

H  M  (tv- 1) 

H  MN 

5  links 
10  channels 


40  x  48  CM  AC  using  4  modules 


<n 


X 


N 


5  links 
12  channels 


Note:  Computation  limit  normalized  for  architecture 


©  2003  Mercury  Computer  Systems,  Inc. 


26 


I/O  and  compute  bounds  touch  --  high  utilization 
40  x  96  unconstrained  matrix  multiply 
20  modules  required  for  FPGA  digital  processor 

♦  Front-end  -  5  modules 

♦  Small-array  beamformer  -  8  modules 

♦  Digital  pulse  compression  -  7  modules 
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Mesh  architecture  gives  highest  multiplier  utilization 
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Large  systems  can  be  created  through  layering  beamformers 

♦  8  beam  system,  20  channels  per  beam  -- 160  channels 

♦  160  x  96  unconstrained  matrix  multiply 

65  modules  required  for  FPGA  digital  processor 

♦  Front-end  -  5  modules 

♦  Small-array  beamformer  -  32  modules 

♦  Digital  pulse  compression  -  28  modules 
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FPGAs  can  provide  efficient  I/O  and  computational  power  to 
address  high  input  bandwidths  of  modern  radar  systems. 

♦  Front-end  processing 

♦  Sub-array  beamformer 

♦  Digital  pulse  compression 

♦  Adaptive  beamforming 

System  topologies  that  provide  efficient  utilization  of 
computational  and  I/O  resources  change  dramatically  as  system 
requirements  scale. 

♦  Watch  I/O  and  computation  bounds 

Small  changes  in  system  requirements  can  dramatically 
increase  complexity  of  FPGA  implementations  when 
computational  bounds  of  embedded  resources  is  exceeded. 

♦  Watch  for  symmetries  in  filters 

♦  Watch  bit  growth  before  18-bit  multipliers 

FPGAs  should  be  used  until  application  of  adaptive 
beamforming  weights  due  to  high  bandwidth  dataflow. 
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